On ergodic two-armed bandits
نویسندگان
چکیده
A device has two arms with unknown deterministic payoffs, and the aim is to asymptotically identify the best one without spending too much time on the other. The Narendra algorithm offers a stochastic procedure to this end. We show under weak ergodic assumptions on these deterministic payoffs that the procedure eventually chooses the best arm (i.e. with greatest Cesaro limit) with probability one, for appropriate step sequences of the algorithm. In the case of i.i.d. payoffs, this implies a “quenched” version of the “annealed” result of Lamberton, Pagès and Tarrès in 2004 [6] by the law of iterated logarithm, thus generalizing it. More precisely, if (ηl,i)i∈N ∈ {0, 1} , l ∈ {A,B}, are the deterministic reward sequences we would get if we played at time i, we obtain infallibility with the same assumption on nonincreasing step sequences on the payoffs as in [6], replacing the i.i.d. assumption by the hypothesis that the empirical averages ∑n i=1 ηA,i/n and ∑n i=1 ηB,i/n converge, as n tends to infinity, respectively to θA and θB, with rate at least 1/(log n)1+ε, for some ε > 0. University of Oxford, Mathematical Institute, 24-29 St Giles, Oxford OX1 3LB, United Kingdom, [email protected]. Supported in part by the Swiss National Foundation Grant 200021-1036251/1, and by a Leverhulme Prize. Université Paris-Est Marne-la-Vallée, LAMA, 5 boulevard Descartes, Champs-surMarne, 77454 Marne-la-Vallée Cedex 2, France, [email protected] AMS 2000 Subject Classifications. Primary 62L20; secondary 93C40, 91E40, 68T05, 91B32.
منابع مشابه
Semi-Bandits with Knapsacks
We unify two prominent lines of work on multi-armed bandits: bandits with knapsacks and combinatorial semi-bandits. The former concerns limited “resources” consumed by the algorithm, e.g., limited supply in dynamic pricing. The latter allows a huge number of actions but assumes combinatorial structure and additional feedback to make the problem tractable. We define a common generalization, supp...
متن کاملModal Bandits
Analyses of multi-armed bandits primarily presume that the value of an arm is its expected reward. We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions.
متن کاملStrategic Exit with Random Observations
In the standard optimal stopping problems, actions are artificially restricted to the moments of observations of costs or benefits. In the standard experimentation and learning models based on two-armed Poisson bandits, it is possible to take an action between two sequential observations. The latter models do not recognize the fact that timing of decisions depends not only on the rate of arriva...
متن کاملSequential Monte Carlo Bandits
In this paper we propose a flexible and efficient framework for handling multi-armed bandits, combining sequential Monte Carlo algorithms with hierarchical Bayesian modeling techniques. The framework naturally encompasses restless bandits, contextual bandits, and other bandit variants under a single inferential model. Despite the model’s generality, we propose efficient Monte Carlo algorithms t...
متن کاملA Survey on Contextual Multi-armed Bandits
4 Stochastic Contextual Bandits 6 4.1 Stochastic Contextual Bandits with Linear Realizability Assumption . . . . 6 4.1.1 LinUCB/SupLinUCB . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.1.2 LinREL/SupLinREL . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.1.3 CofineUCB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.1.4 Thompson Sampling with Linear Payoffs...
متن کاملReducing Dueling Bandits to Cardinal Bandits
We present algorithms for reducing the Dueling Bandits problem to the conventional (stochastic) Multi-Armed Bandits problem. The Dueling Bandits problem is an online model of learning with ordinal feedback of the form “A is preferred to B” (as opposed to cardinal feedback like “A has value 2.5”), giving it wide applicability in learning from implicit user feedback and revealed and stated prefer...
متن کامل